MACHINE LEARNING

Exploratory Data Analysis¶

José Carlos Monescillo Calzado¶

  • Objetivo¶

El objetivo principal de esta práctica es emplear técnicas de Análisis Exploratorio de Datos (EDA) para comprender a fondo un conjunto de datos relacionado con solicitudes de préstamos. Este análisis busca identificar patrones y relaciones entre las variables que permitan predecir el comportamiento de los clientes respecto al cumplimiento o incumplimiento de sus pagos. En este sentido, se evaluarán tanto variables demográficas como características financieras de los solicitantes, para determinar su influencia en la capacidad de devolver un préstamo. Este objetivo incluye el análisis detallado de posibles correlaciones entre las variables, la imputación de valores nulos, la transformación de datos categóricos y numéricos, y el escalado de las características según sea necesario.

En última instancia, el propósito es responder a la pregunta clave de negocio planteada: "¿Hay algún tipo de cliente más propenso a no devolver un préstamo?". Esto servirá como base para proponer estrategias que puedan mejorar la evaluación de riesgo y reducir las tasas de incumplimiento en futuras decisiones de crédito.

  • Pasos¶

  1. Exploración general de los datos: se analizan la distribución de los datos y las relaciones entre las distintas variables en relación a la variable objetivo.
  2. Correlaciones, valores nulos y outliers: se evaluan las correlaciones entre variables, se tratan los valores faltantes y los outliers. Se elaboran estrategias para su eliminación o tratamiento.
  3. Codificación y escalado: se codifican las variables categóricas para poderlas emplearlas en el modelo. Se implementan las técnicas de OneHotEncoding y TargetEncoding en función de los valores únicos de cada variable.
  • Caso de negocio¶

El caso de negocio gira en torno a un banco que se especializa en ofrecer distintos tipos de préstamos a sus clientes. Actualmente, la institución enfrenta el desafío de identificar a los solicitantes que representan un mayor riesgo de incumplimiento. Para abordar este problema, se busca analizar el comportamiento histórico de los clientes que no han cumplido con sus pagos y establecer criterios claros para mitigar los riesgos en futuras operaciones de préstamo.

El análisis exploratorio se convierte, en este contexto, en una herramienta clave para segmentar a los clientes según sus características y establecer perfiles de riesgo más precisos. Al responder preguntas como qué tipo de clientes o condiciones financieras están más relacionadas con el incumplimiento, el banco puede mejorar sus procesos de evaluación crediticia.

LIBRERIAS¶

In [30]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px



pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 5000)

seed=123

Diccionario de datos¶

In [31]:
ruta = '../data_practica_EDA/columns_description.xlsx'
dictionary = pd.read_excel(ruta, index_col=0)
dictionary
Out[31]:
Table Row Description Special
1 application_data SK_ID_CURR ID of loan in our sample NaN
2 application_data TARGET Target variable (1 - client with payment diffi... NaN
5 application_data NAME_CONTRACT_TYPE Identification if loan is cash or revolving NaN
6 application_data CODE_GENDER Gender of the client NaN
7 application_data FLAG_OWN_CAR Flag if the client owns a car NaN
8 application_data FLAG_OWN_REALTY Flag if client owns a house or flat NaN
9 application_data CNT_CHILDREN Number of children the client has NaN
10 application_data AMT_INCOME_TOTAL Income of the client NaN
11 application_data AMT_CREDIT Credit amount of the loan NaN
12 application_data AMT_ANNUITY Loan annuity NaN
13 application_data AMT_GOODS_PRICE For consumer loans it is the price of the good... NaN
14 application_data NAME_TYPE_SUITE Who was accompanying client when he was applyi... NaN
15 application_data NAME_INCOME_TYPE Clients income type (businessman, working, mat... NaN
16 application_data NAME_EDUCATION_TYPE Level of highest education the client achieved NaN
17 application_data NAME_FAMILY_STATUS Family status of the client NaN
18 application_data NAME_HOUSING_TYPE What is the housing situation of the client (r... NaN
19 application_data REGION_POPULATION_RELATIVE Normalized population of region where client l... normalized
20 application_data DAYS_BIRTH Client's age in days at the time of application time only relative to the application
21 application_data DAYS_EMPLOYED How many days before the application the perso... time only relative to the application
22 application_data DAYS_REGISTRATION How many days before the application did clien... time only relative to the application
23 application_data DAYS_ID_PUBLISH How many days before the application did clien... time only relative to the application
24 application_data OWN_CAR_AGE Age of client's car NaN
25 application_data FLAG_MOBIL Did client provide mobile phone (1=YES, 0=NO) NaN
26 application_data FLAG_EMP_PHONE Did client provide work phone (1=YES, 0=NO) NaN
27 application_data FLAG_WORK_PHONE Did client provide home phone (1=YES, 0=NO) NaN
28 application_data FLAG_CONT_MOBILE Was mobile phone reachable (1=YES, 0=NO) NaN
29 application_data FLAG_PHONE Did client provide home phone (1=YES, 0=NO) NaN
30 application_data FLAG_EMAIL Did client provide email (1=YES, 0=NO) NaN
31 application_data OCCUPATION_TYPE What kind of occupation does the client have NaN
32 application_data CNT_FAM_MEMBERS How many family members does client have NaN
33 application_data REGION_RATING_CLIENT Our rating of the region where client lives (1... NaN
34 application_data REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives wi... NaN
35 application_data WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply ... NaN
36 application_data HOUR_APPR_PROCESS_START Approximately at what hour did the client appl... rounded
37 application_data REG_REGION_NOT_LIVE_REGION Flag if client's permanent address does not ma... NaN
38 application_data REG_REGION_NOT_WORK_REGION Flag if client's permanent address does not ma... NaN
39 application_data LIVE_REGION_NOT_WORK_REGION Flag if client's contact address does not matc... NaN
40 application_data REG_CITY_NOT_LIVE_CITY Flag if client's permanent address does not ma... NaN
41 application_data REG_CITY_NOT_WORK_CITY Flag if client's permanent address does not ma... NaN
42 application_data LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not matc... NaN
43 application_data ORGANIZATION_TYPE Type of organization where client works NaN
44 application_data EXT_SOURCE_1 Normalized score from external data source normalized
45 application_data EXT_SOURCE_2 Normalized score from external data source normalized
46 application_data EXT_SOURCE_3 Normalized score from external data source normalized
47 application_data APARTMENTS_AVG Normalized information about building where th... normalized
48 application_data BASEMENTAREA_AVG Normalized information about building where th... normalized
49 application_data YEARS_BEGINEXPLUATATION_AVG Normalized information about building where th... normalized
50 application_data YEARS_BUILD_AVG Normalized information about building where th... normalized
51 application_data COMMONAREA_AVG Normalized information about building where th... normalized
52 application_data ELEVATORS_AVG Normalized information about building where th... normalized
53 application_data ENTRANCES_AVG Normalized information about building where th... normalized
54 application_data FLOORSMAX_AVG Normalized information about building where th... normalized
55 application_data FLOORSMIN_AVG Normalized information about building where th... normalized
56 application_data LANDAREA_AVG Normalized information about building where th... normalized
57 application_data LIVINGAPARTMENTS_AVG Normalized information about building where th... normalized
58 application_data LIVINGAREA_AVG Normalized information about building where th... normalized
59 application_data NONLIVINGAPARTMENTS_AVG Normalized information about building where th... normalized
60 application_data NONLIVINGAREA_AVG Normalized information about building where th... normalized
61 application_data APARTMENTS_MODE Normalized information about building where th... normalized
62 application_data BASEMENTAREA_MODE Normalized information about building where th... normalized
63 application_data YEARS_BEGINEXPLUATATION_MODE Normalized information about building where th... normalized
64 application_data YEARS_BUILD_MODE Normalized information about building where th... normalized
65 application_data COMMONAREA_MODE Normalized information about building where th... normalized
66 application_data ELEVATORS_MODE Normalized information about building where th... normalized
67 application_data ENTRANCES_MODE Normalized information about building where th... normalized
68 application_data FLOORSMAX_MODE Normalized information about building where th... normalized
69 application_data FLOORSMIN_MODE Normalized information about building where th... normalized
70 application_data LANDAREA_MODE Normalized information about building where th... normalized
71 application_data LIVINGAPARTMENTS_MODE Normalized information about building where th... normalized
72 application_data LIVINGAREA_MODE Normalized information about building where th... normalized
73 application_data NONLIVINGAPARTMENTS_MODE Normalized information about building where th... normalized
74 application_data NONLIVINGAREA_MODE Normalized information about building where th... normalized
75 application_data APARTMENTS_MEDI Normalized information about building where th... normalized
76 application_data BASEMENTAREA_MEDI Normalized information about building where th... normalized
77 application_data YEARS_BEGINEXPLUATATION_MEDI Normalized information about building where th... normalized
78 application_data YEARS_BUILD_MEDI Normalized information about building where th... normalized
79 application_data COMMONAREA_MEDI Normalized information about building where th... normalized
80 application_data ELEVATORS_MEDI Normalized information about building where th... normalized
81 application_data ENTRANCES_MEDI Normalized information about building where th... normalized
82 application_data FLOORSMAX_MEDI Normalized information about building where th... normalized
83 application_data FLOORSMIN_MEDI Normalized information about building where th... normalized
84 application_data LANDAREA_MEDI Normalized information about building where th... normalized
85 application_data LIVINGAPARTMENTS_MEDI Normalized information about building where th... normalized
86 application_data LIVINGAREA_MEDI Normalized information about building where th... normalized
87 application_data NONLIVINGAPARTMENTS_MEDI Normalized information about building where th... normalized
88 application_data NONLIVINGAREA_MEDI Normalized information about building where th... normalized
89 application_data FONDKAPREMONT_MODE Normalized information about building where th... normalized
90 application_data HOUSETYPE_MODE Normalized information about building where th... normalized
91 application_data TOTALAREA_MODE Normalized information about building where th... normalized
92 application_data WALLSMATERIAL_MODE Normalized information about building where th... normalized
93 application_data EMERGENCYSTATE_MODE Normalized information about building where th... normalized
94 application_data OBS_30_CNT_SOCIAL_CIRCLE How many observation of client's social surrou... NaN
95 application_data DEF_30_CNT_SOCIAL_CIRCLE How many observation of client's social surrou... NaN
96 application_data OBS_60_CNT_SOCIAL_CIRCLE How many observation of client's social surrou... NaN
97 application_data DEF_60_CNT_SOCIAL_CIRCLE How many observation of client's social surrou... NaN
98 application_data DAYS_LAST_PHONE_CHANGE How many days before application did client ch... NaN
99 application_data FLAG_DOCUMENT_2 Did client provide document 2 NaN
100 application_data FLAG_DOCUMENT_3 Did client provide document 3 NaN
101 application_data FLAG_DOCUMENT_4 Did client provide document 4 NaN
102 application_data FLAG_DOCUMENT_5 Did client provide document 5 NaN
103 application_data FLAG_DOCUMENT_6 Did client provide document 6 NaN
104 application_data FLAG_DOCUMENT_7 Did client provide document 7 NaN
105 application_data FLAG_DOCUMENT_8 Did client provide document 8 NaN
106 application_data FLAG_DOCUMENT_9 Did client provide document 9 NaN
107 application_data FLAG_DOCUMENT_10 Did client provide document 10 NaN
108 application_data FLAG_DOCUMENT_11 Did client provide document 11 NaN
109 application_data FLAG_DOCUMENT_12 Did client provide document 12 NaN
110 application_data FLAG_DOCUMENT_13 Did client provide document 13 NaN
111 application_data FLAG_DOCUMENT_14 Did client provide document 14 NaN
112 application_data FLAG_DOCUMENT_15 Did client provide document 15 NaN
113 application_data FLAG_DOCUMENT_16 Did client provide document 16 NaN
114 application_data FLAG_DOCUMENT_17 Did client provide document 17 NaN
115 application_data FLAG_DOCUMENT_18 Did client provide document 18 NaN
116 application_data FLAG_DOCUMENT_19 Did client provide document 19 NaN
117 application_data FLAG_DOCUMENT_20 Did client provide document 20 NaN
118 application_data FLAG_DOCUMENT_21 Did client provide document 21 NaN
119 application_data AMT_REQ_CREDIT_BUREAU_HOUR Number of enquiries to Credit Bureau about the... NaN
120 application_data AMT_REQ_CREDIT_BUREAU_DAY Number of enquiries to Credit Bureau about the... NaN
121 application_data AMT_REQ_CREDIT_BUREAU_WEEK Number of enquiries to Credit Bureau about the... NaN
122 application_data AMT_REQ_CREDIT_BUREAU_MON Number of enquiries to Credit Bureau about the... NaN
123 application_data AMT_REQ_CREDIT_BUREAU_QRT Number of enquiries to Credit Bureau about the... NaN
124 application_data AMT_REQ_CREDIT_BUREAU_YEAR Number of enquiries to Credit Bureau about the... NaN
176 previous_application.csv SK_ID_PREV ID of previous credit in Home credit related t... hashed
177 previous_application.csv SK_ID_CURR ID of loan in our sample hashed
178 previous_application.csv NAME_CONTRACT_TYPE Contract product type (Cash loan, consumer loa... NaN
179 previous_application.csv AMT_ANNUITY Annuity of previous application NaN
180 previous_application.csv AMT_APPLICATION For how much credit did client ask on the prev... NaN
181 previous_application.csv AMT_CREDIT Final credit amount on the previous applicatio... NaN
182 previous_application.csv AMT_DOWN_PAYMENT Down payment on the previous application NaN
183 previous_application.csv AMT_GOODS_PRICE Goods price of good that client asked for (if ... NaN
184 previous_application.csv WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply ... NaN
185 previous_application.csv HOUR_APPR_PROCESS_START Approximately at what day hour did the client ... rounded
186 previous_application.csv FLAG_LAST_APPL_PER_CONTRACT Flag if it was last application for the previo... NaN
187 previous_application.csv NFLAG_LAST_APPL_IN_DAY Flag if the application was the last applicati... NaN
188 previous_application.csv NFLAG_MICRO_CASH Flag Micro finance loan NaN
189 previous_application.csv RATE_DOWN_PAYMENT Down payment rate normalized on previous credit normalized
190 previous_application.csv RATE_INTEREST_PRIMARY Interest rate normalized on previous credit normalized
191 previous_application.csv RATE_INTEREST_PRIVILEGED Interest rate normalized on previous credit normalized
192 previous_application.csv NAME_CASH_LOAN_PURPOSE Purpose of the cash loan NaN
193 previous_application.csv NAME_CONTRACT_STATUS Contract status (approved, cancelled, ...) of ... NaN
194 previous_application.csv DAYS_DECISION Relative to current application when was the d... time only relative to the application
195 previous_application.csv NAME_PAYMENT_TYPE Payment method that client chose to pay for th... NaN
196 previous_application.csv CODE_REJECT_REASON Why was the previous application rejected NaN
197 previous_application.csv NAME_TYPE_SUITE Who accompanied client when applying for the p... NaN
198 previous_application.csv NAME_CLIENT_TYPE Was the client old or new client when applying... NaN
199 previous_application.csv NAME_GOODS_CATEGORY What kind of goods did the client apply for in... NaN
200 previous_application.csv NAME_PORTFOLIO Was the previous application for CASH, POS, CA... NaN
201 previous_application.csv NAME_PRODUCT_TYPE Was the previous application x-sell o walk-in NaN
202 previous_application.csv CHANNEL_TYPE Through which channel we acquired the client o... NaN
203 previous_application.csv SELLERPLACE_AREA Selling area of seller place of the previous a... NaN
204 previous_application.csv NAME_SELLER_INDUSTRY The industry of the seller NaN
205 previous_application.csv CNT_PAYMENT Term of previous credit at application of the ... NaN
206 previous_application.csv NAME_YIELD_GROUP Grouped interest rate into small medium and hi... grouped
207 previous_application.csv PRODUCT_COMBINATION Detailed product combination of the previous a... NaN
208 previous_application.csv DAYS_FIRST_DRAWING Relative to application date of current applic... time only relative to the application
209 previous_application.csv DAYS_FIRST_DUE Relative to application date of current applic... time only relative to the application
210 previous_application.csv DAYS_LAST_DUE_1ST_VERSION Relative to application date of current applic... time only relative to the application
211 previous_application.csv DAYS_LAST_DUE Relative to application date of current applic... time only relative to the application
212 previous_application.csv DAYS_TERMINATION Relative to application date of current applic... time only relative to the application
213 previous_application.csv NFLAG_INSURED_ON_APPROVAL Did the client requested insurance during the ... NaN

FUNCIONES¶

A partir del segundo notebook, todas las funciones que se emplearán estarán importadas desde el archivo "funciones_auxiliares".

In [32]:
def dame_variables_categoricas(dataset=None, max_unicos=100): 

    """ Se indica el valor de valores únicos máximo(max_unicos) como un parametro de manera que si lo escribimos a lo largo de la función
    más de una vez solo se tenga que cambiar en una línea.
    
    Inputs:
        -- dataset: pandas dataframe que contiene los datos
        -- max_unicos: máximo de unicos que se desean por columna(variable)
        """




    if dataset is None:
        print(u'\nFaltan argumentos por pasar a la función')
        return 1

    lista_variables_categoricas = []
    other = []

    for i in dataset.columns:
        unicos = int(len(np.unique(dataset[i].dropna(axis=0, how='all'))))
        if dataset[i].dtype in ['object', 'category'] or unicos < max_unicos: # A traves de la lista simplificamos el código.
                lista_variables_categoricas.append(i)
        else:
                other.append(i)

    return lista_variables_categoricas, other
In [33]:
pd_loan = pd.read_csv('../data_practica_EDA/application_data.csv').set_index('SK_ID_CURR')
pd_loan
Out[33]:
TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR
100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.0220 0.0198 0.0 0.0000 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.0000 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.0790 0.0554 0.0 0.0000 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.0100 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 Government NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
456251 0 Cash loans M N N 0 157500.0 254700.0 27558.0 225000.0 Unaccompanied Working Secondary / secondary special Separated With parents 0.032561 -9327 -236 -8456.0 -1982 NaN 1 1 0 1 0 0 Sales staff 1.0 1 1 THURSDAY 15 0 0 0 0 0 0 Services 0.145570 0.681632 NaN 0.2021 0.0887 0.9876 0.8300 0.0202 0.22 0.1034 0.6042 0.2708 0.0594 0.1484 0.1965 0.0753 0.1095 0.1008 0.0172 0.9782 0.7125 0.0172 0.0806 0.0345 0.4583 0.0417 0.0094 0.0882 0.0853 0.0 0.0125 0.2040 0.0887 0.9876 0.8323 0.0203 0.22 0.1034 0.6042 0.2708 0.0605 0.1509 0.2001 0.0757 0.1118 reg oper account block of flats 0.2898 Stone, brick No 0.0 0.0 0.0 0.0 -273.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
456252 0 Cash loans F N Y 0 72000.0 269550.0 12001.5 225000.0 Unaccompanied Pensioner Secondary / secondary special Widow House / apartment 0.025164 -20775 365243 -4388.0 -4090 NaN 1 0 0 1 1 0 NaN 1.0 2 2 MONDAY 8 0 0 0 0 0 0 XNA NaN 0.115992 NaN 0.0247 0.0435 0.9727 0.6260 0.0022 0.00 0.1034 0.0833 0.1250 0.0579 0.0202 0.0257 0.0000 0.0000 0.0252 0.0451 0.9727 0.6406 0.0022 0.0000 0.1034 0.0833 0.1250 0.0592 0.0220 0.0267 0.0 0.0000 0.0250 0.0435 0.9727 0.6310 0.0022 0.00 0.1034 0.0833 0.1250 0.0589 0.0205 0.0261 0.0000 0.0000 reg oper account block of flats 0.0214 Stone, brick No 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
456253 0 Cash loans F N Y 0 153000.0 677664.0 29979.0 585000.0 Unaccompanied Working Higher education Separated House / apartment 0.005002 -14966 -7921 -6737.0 -5150 NaN 1 1 0 1 0 1 Managers 1.0 3 3 THURSDAY 9 0 0 0 0 1 1 School 0.744026 0.535722 0.218859 0.1031 0.0862 0.9816 0.7484 0.0123 0.00 0.2069 0.1667 0.2083 NaN 0.0841 0.9279 0.0000 0.0000 0.1050 0.0894 0.9816 0.7583 0.0124 0.0000 0.2069 0.1667 0.2083 NaN 0.0918 0.9667 0.0 0.0000 0.1041 0.0862 0.9816 0.7518 0.0124 0.00 0.2069 0.1667 0.2083 NaN 0.0855 0.9445 0.0000 0.0000 reg oper account block of flats 0.7970 Panel No 6.0 0.0 6.0 0.0 -1909.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1.0 0.0 0.0 1.0 0.0 1.0
456254 1 Cash loans F N Y 0 171000.0 370107.0 20205.0 319500.0 Unaccompanied Commercial associate Secondary / secondary special Married House / apartment 0.005313 -11961 -4786 -2562.0 -931 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 9 0 0 0 1 1 0 Business Entity Type 1 NaN 0.514163 0.661024 0.0124 NaN 0.9771 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0061 NaN NaN 0.0126 NaN 0.9772 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0063 NaN NaN 0.0125 NaN 0.9771 NaN NaN NaN 0.0690 0.0417 NaN NaN NaN 0.0062 NaN NaN NaN block of flats 0.0086 Stone, brick No 0.0 0.0 0.0 0.0 -322.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
456255 0 Cash loans F N N 0 157500.0 675000.0 49117.5 675000.0 Unaccompanied Commercial associate Higher education Married House / apartment 0.046220 -16856 -1262 -5128.0 -410 NaN 1 1 1 1 1 0 Laborers 2.0 1 1 THURSDAY 20 0 0 0 0 1 1 Business Entity Type 3 0.734460 0.708569 0.113922 0.0742 0.0526 0.9881 NaN 0.0176 0.08 0.0690 0.3750 NaN NaN NaN 0.0791 NaN 0.0000 0.0756 0.0546 0.9881 NaN 0.0178 0.0806 0.0690 0.3750 NaN NaN NaN 0.0824 NaN 0.0000 0.0749 0.0526 0.9881 NaN 0.0177 0.08 0.0690 0.3750 NaN NaN NaN 0.0805 NaN 0.0000 NaN block of flats 0.0718 Panel No 0.0 0.0 0.0 0.0 -787.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 2.0 0.0 1.0

307511 rows × 121 columns

Dimensión del dataset

In [34]:
print(pd_loan.shape, pd_loan.drop_duplicates().shape)
(307511, 121) (307511, 121)

Tras la ejecución de este código en el que imprimimos la dimensión (columnas y filas) del dataset antes y después de la comprobación de duplicados, se observa que no se ha eliminado ningún registro, lo que indica que no hay duplicados. El dataframe está compuesto de 307.511 filas y 121 columnas.

In [35]:
pd_loan.dtypes.to_dict()
Out[35]:
{'TARGET': dtype('int64'),
 'NAME_CONTRACT_TYPE': dtype('O'),
 'CODE_GENDER': dtype('O'),
 'FLAG_OWN_CAR': dtype('O'),
 'FLAG_OWN_REALTY': dtype('O'),
 'CNT_CHILDREN': dtype('int64'),
 'AMT_INCOME_TOTAL': dtype('float64'),
 'AMT_CREDIT': dtype('float64'),
 'AMT_ANNUITY': dtype('float64'),
 'AMT_GOODS_PRICE': dtype('float64'),
 'NAME_TYPE_SUITE': dtype('O'),
 'NAME_INCOME_TYPE': dtype('O'),
 'NAME_EDUCATION_TYPE': dtype('O'),
 'NAME_FAMILY_STATUS': dtype('O'),
 'NAME_HOUSING_TYPE': dtype('O'),
 'REGION_POPULATION_RELATIVE': dtype('float64'),
 'DAYS_BIRTH': dtype('int64'),
 'DAYS_EMPLOYED': dtype('int64'),
 'DAYS_REGISTRATION': dtype('float64'),
 'DAYS_ID_PUBLISH': dtype('int64'),
 'OWN_CAR_AGE': dtype('float64'),
 'FLAG_MOBIL': dtype('int64'),
 'FLAG_EMP_PHONE': dtype('int64'),
 'FLAG_WORK_PHONE': dtype('int64'),
 'FLAG_CONT_MOBILE': dtype('int64'),
 'FLAG_PHONE': dtype('int64'),
 'FLAG_EMAIL': dtype('int64'),
 'OCCUPATION_TYPE': dtype('O'),
 'CNT_FAM_MEMBERS': dtype('float64'),
 'REGION_RATING_CLIENT': dtype('int64'),
 'REGION_RATING_CLIENT_W_CITY': dtype('int64'),
 'WEEKDAY_APPR_PROCESS_START': dtype('O'),
 'HOUR_APPR_PROCESS_START': dtype('int64'),
 'REG_REGION_NOT_LIVE_REGION': dtype('int64'),
 'REG_REGION_NOT_WORK_REGION': dtype('int64'),
 'LIVE_REGION_NOT_WORK_REGION': dtype('int64'),
 'REG_CITY_NOT_LIVE_CITY': dtype('int64'),
 'REG_CITY_NOT_WORK_CITY': dtype('int64'),
 'LIVE_CITY_NOT_WORK_CITY': dtype('int64'),
 'ORGANIZATION_TYPE': dtype('O'),
 'EXT_SOURCE_1': dtype('float64'),
 'EXT_SOURCE_2': dtype('float64'),
 'EXT_SOURCE_3': dtype('float64'),
 'APARTMENTS_AVG': dtype('float64'),
 'BASEMENTAREA_AVG': dtype('float64'),
 'YEARS_BEGINEXPLUATATION_AVG': dtype('float64'),
 'YEARS_BUILD_AVG': dtype('float64'),
 'COMMONAREA_AVG': dtype('float64'),
 'ELEVATORS_AVG': dtype('float64'),
 'ENTRANCES_AVG': dtype('float64'),
 'FLOORSMAX_AVG': dtype('float64'),
 'FLOORSMIN_AVG': dtype('float64'),
 'LANDAREA_AVG': dtype('float64'),
 'LIVINGAPARTMENTS_AVG': dtype('float64'),
 'LIVINGAREA_AVG': dtype('float64'),
 'NONLIVINGAPARTMENTS_AVG': dtype('float64'),
 'NONLIVINGAREA_AVG': dtype('float64'),
 'APARTMENTS_MODE': dtype('float64'),
 'BASEMENTAREA_MODE': dtype('float64'),
 'YEARS_BEGINEXPLUATATION_MODE': dtype('float64'),
 'YEARS_BUILD_MODE': dtype('float64'),
 'COMMONAREA_MODE': dtype('float64'),
 'ELEVATORS_MODE': dtype('float64'),
 'ENTRANCES_MODE': dtype('float64'),
 'FLOORSMAX_MODE': dtype('float64'),
 'FLOORSMIN_MODE': dtype('float64'),
 'LANDAREA_MODE': dtype('float64'),
 'LIVINGAPARTMENTS_MODE': dtype('float64'),
 'LIVINGAREA_MODE': dtype('float64'),
 'NONLIVINGAPARTMENTS_MODE': dtype('float64'),
 'NONLIVINGAREA_MODE': dtype('float64'),
 'APARTMENTS_MEDI': dtype('float64'),
 'BASEMENTAREA_MEDI': dtype('float64'),
 'YEARS_BEGINEXPLUATATION_MEDI': dtype('float64'),
 'YEARS_BUILD_MEDI': dtype('float64'),
 'COMMONAREA_MEDI': dtype('float64'),
 'ELEVATORS_MEDI': dtype('float64'),
 'ENTRANCES_MEDI': dtype('float64'),
 'FLOORSMAX_MEDI': dtype('float64'),
 'FLOORSMIN_MEDI': dtype('float64'),
 'LANDAREA_MEDI': dtype('float64'),
 'LIVINGAPARTMENTS_MEDI': dtype('float64'),
 'LIVINGAREA_MEDI': dtype('float64'),
 'NONLIVINGAPARTMENTS_MEDI': dtype('float64'),
 'NONLIVINGAREA_MEDI': dtype('float64'),
 'FONDKAPREMONT_MODE': dtype('O'),
 'HOUSETYPE_MODE': dtype('O'),
 'TOTALAREA_MODE': dtype('float64'),
 'WALLSMATERIAL_MODE': dtype('O'),
 'EMERGENCYSTATE_MODE': dtype('O'),
 'OBS_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
 'DEF_30_CNT_SOCIAL_CIRCLE': dtype('float64'),
 'OBS_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
 'DEF_60_CNT_SOCIAL_CIRCLE': dtype('float64'),
 'DAYS_LAST_PHONE_CHANGE': dtype('float64'),
 'FLAG_DOCUMENT_2': dtype('int64'),
 'FLAG_DOCUMENT_3': dtype('int64'),
 'FLAG_DOCUMENT_4': dtype('int64'),
 'FLAG_DOCUMENT_5': dtype('int64'),
 'FLAG_DOCUMENT_6': dtype('int64'),
 'FLAG_DOCUMENT_7': dtype('int64'),
 'FLAG_DOCUMENT_8': dtype('int64'),
 'FLAG_DOCUMENT_9': dtype('int64'),
 'FLAG_DOCUMENT_10': dtype('int64'),
 'FLAG_DOCUMENT_11': dtype('int64'),
 'FLAG_DOCUMENT_12': dtype('int64'),
 'FLAG_DOCUMENT_13': dtype('int64'),
 'FLAG_DOCUMENT_14': dtype('int64'),
 'FLAG_DOCUMENT_15': dtype('int64'),
 'FLAG_DOCUMENT_16': dtype('int64'),
 'FLAG_DOCUMENT_17': dtype('int64'),
 'FLAG_DOCUMENT_18': dtype('int64'),
 'FLAG_DOCUMENT_19': dtype('int64'),
 'FLAG_DOCUMENT_20': dtype('int64'),
 'FLAG_DOCUMENT_21': dtype('int64'),
 'AMT_REQ_CREDIT_BUREAU_HOUR': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_DAY': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_WEEK': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_MON': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_QRT': dtype('float64'),
 'AMT_REQ_CREDIT_BUREAU_YEAR': dtype('float64')}

Se observa cada columna asociada a una variable y el tipo de variable que es, object o numéricas.

Exploración de la variable objetivo¶

In [36]:
pd_plot_target = pd_loan['TARGET']\
        .value_counts(normalize=True)\
        .mul(100).rename('percent').reset_index()

pd_plot_target_conteo = pd_loan['TARGET'].value_counts().reset_index()
pd_plot_target_pc = pd.merge(pd_plot_target, pd_plot_target_conteo, on=['TARGET'], how='inner')
pd_plot_target_pc
Out[36]:
TARGET percent count
0 0 91.927118 282686
1 1 8.072882 24825
In [ ]:
import plotly.io as pio
pio.renderers.default = 'notebook' # Tengo problemas para visualizar el gráfico en html pero con esta linea se arregla

fig = px.bar(pd_plot_target_pc, x="TARGET", y='percent', text='percent', labels={"TARGET": "Estado de Pago (TARGET)", "percent": "Porcentaje"})
fig.show()

La variable 'TARGET' nos muestra si se ha producido retraso en el pago de las cuotas del préstamo. El 91.93% de los clientes han pagado las cuotas del prestamo sin retraso mientras que el 8.07% han tenido dificultades para pagar. En valores absolutos son 282686 quienes pagaron a tiempo frente a 24825 personas que se retrasaron en el pago. Se observa que la variable está desbalanceada.

Selección de threshold (umbral) por filas y columnas para suprimir valores missing¶

In [38]:
pd_series_null_columns = pd_loan.isnull().sum().sort_values(ascending=False)
pd_series_null_rows = pd_loan.isnull().sum(axis=1).sort_values(ascending=False)
print(pd_series_null_columns.shape, pd_series_null_rows.shape)

pd_null_columnas = pd.DataFrame(pd_series_null_columns, columns=['nulos_columnas'])     
pd_null_filas = pd.DataFrame(pd_series_null_rows, columns=['nulos_filas'])  
pd_null_filas['target'] = pd_loan['TARGET'].copy()
pd_null_columnas['porcentaje_columnas'] = pd_null_columnas['nulos_columnas']/pd_loan.shape[0]
pd_null_filas['porcentaje_filas']= pd_null_filas['nulos_filas']/pd_loan.shape[1]
(121,) (307511,)
In [39]:
pd_loan.shape
Out[39]:
(307511, 121)
In [40]:
pd_null_columnas
Out[40]:
nulos_columnas porcentaje_columnas
COMMONAREA_AVG 214865 0.698723
COMMONAREA_MODE 214865 0.698723
COMMONAREA_MEDI 214865 0.698723
NONLIVINGAPARTMENTS_AVG 213514 0.694330
NONLIVINGAPARTMENTS_MODE 213514 0.694330
NONLIVINGAPARTMENTS_MEDI 213514 0.694330
FONDKAPREMONT_MODE 210295 0.683862
LIVINGAPARTMENTS_MEDI 210199 0.683550
LIVINGAPARTMENTS_AVG 210199 0.683550
LIVINGAPARTMENTS_MODE 210199 0.683550
FLOORSMIN_AVG 208642 0.678486
FLOORSMIN_MODE 208642 0.678486
FLOORSMIN_MEDI 208642 0.678486
YEARS_BUILD_AVG 204488 0.664978
YEARS_BUILD_MEDI 204488 0.664978
YEARS_BUILD_MODE 204488 0.664978
OWN_CAR_AGE 202929 0.659908
LANDAREA_MEDI 182590 0.593767
LANDAREA_AVG 182590 0.593767
LANDAREA_MODE 182590 0.593767
BASEMENTAREA_MEDI 179943 0.585160
BASEMENTAREA_AVG 179943 0.585160
BASEMENTAREA_MODE 179943 0.585160
EXT_SOURCE_1 173378 0.563811
NONLIVINGAREA_AVG 169682 0.551792
NONLIVINGAREA_MODE 169682 0.551792
NONLIVINGAREA_MEDI 169682 0.551792
ELEVATORS_MODE 163891 0.532960
ELEVATORS_AVG 163891 0.532960
ELEVATORS_MEDI 163891 0.532960
WALLSMATERIAL_MODE 156341 0.508408
APARTMENTS_AVG 156061 0.507497
APARTMENTS_MEDI 156061 0.507497
APARTMENTS_MODE 156061 0.507497
ENTRANCES_AVG 154828 0.503488
ENTRANCES_MEDI 154828 0.503488
ENTRANCES_MODE 154828 0.503488
LIVINGAREA_MEDI 154350 0.501933
LIVINGAREA_MODE 154350 0.501933
LIVINGAREA_AVG 154350 0.501933
HOUSETYPE_MODE 154297 0.501761
FLOORSMAX_MEDI 153020 0.497608
FLOORSMAX_MODE 153020 0.497608
FLOORSMAX_AVG 153020 0.497608
YEARS_BEGINEXPLUATATION_MEDI 150007 0.487810
YEARS_BEGINEXPLUATATION_MODE 150007 0.487810
YEARS_BEGINEXPLUATATION_AVG 150007 0.487810
TOTALAREA_MODE 148431 0.482685
EMERGENCYSTATE_MODE 145755 0.473983
OCCUPATION_TYPE 96391 0.313455
EXT_SOURCE_3 60965 0.198253
AMT_REQ_CREDIT_BUREAU_WEEK 41519 0.135016
AMT_REQ_CREDIT_BUREAU_HOUR 41519 0.135016
AMT_REQ_CREDIT_BUREAU_MON 41519 0.135016
AMT_REQ_CREDIT_BUREAU_QRT 41519 0.135016
AMT_REQ_CREDIT_BUREAU_DAY 41519 0.135016
AMT_REQ_CREDIT_BUREAU_YEAR 41519 0.135016
NAME_TYPE_SUITE 1292 0.004201
DEF_30_CNT_SOCIAL_CIRCLE 1021 0.003320
OBS_60_CNT_SOCIAL_CIRCLE 1021 0.003320
OBS_30_CNT_SOCIAL_CIRCLE 1021 0.003320
DEF_60_CNT_SOCIAL_CIRCLE 1021 0.003320
EXT_SOURCE_2 660 0.002146
AMT_GOODS_PRICE 278 0.000904
AMT_ANNUITY 12 0.000039
CNT_FAM_MEMBERS 2 0.000007
DAYS_LAST_PHONE_CHANGE 1 0.000003
AMT_INCOME_TOTAL 0 0.000000
FLAG_DOCUMENT_8 0 0.000000
CODE_GENDER 0 0.000000
FLAG_OWN_CAR 0 0.000000
FLAG_OWN_REALTY 0 0.000000
FLAG_DOCUMENT_2 0 0.000000
FLAG_DOCUMENT_3 0 0.000000
FLAG_DOCUMENT_4 0 0.000000
FLAG_DOCUMENT_5 0 0.000000
FLAG_DOCUMENT_6 0 0.000000
FLAG_DOCUMENT_7 0 0.000000
FLAG_DOCUMENT_9 0 0.000000
FLAG_DOCUMENT_21 0 0.000000
FLAG_DOCUMENT_10 0 0.000000
FLAG_DOCUMENT_11 0 0.000000
CNT_CHILDREN 0 0.000000
FLAG_DOCUMENT_13 0 0.000000
FLAG_DOCUMENT_14 0 0.000000
FLAG_DOCUMENT_15 0 0.000000
FLAG_DOCUMENT_16 0 0.000000
FLAG_DOCUMENT_17 0 0.000000
FLAG_DOCUMENT_18 0 0.000000
FLAG_DOCUMENT_19 0 0.000000
FLAG_DOCUMENT_20 0 0.000000
FLAG_DOCUMENT_12 0 0.000000
AMT_CREDIT 0 0.000000
ORGANIZATION_TYPE 0 0.000000
NAME_INCOME_TYPE 0 0.000000
LIVE_CITY_NOT_WORK_CITY 0 0.000000
NAME_CONTRACT_TYPE 0 0.000000
REG_CITY_NOT_WORK_CITY 0 0.000000
REG_CITY_NOT_LIVE_CITY 0 0.000000
LIVE_REGION_NOT_WORK_REGION 0 0.000000
REG_REGION_NOT_WORK_REGION 0 0.000000
REG_REGION_NOT_LIVE_REGION 0 0.000000
HOUR_APPR_PROCESS_START 0 0.000000
WEEKDAY_APPR_PROCESS_START 0 0.000000
REGION_RATING_CLIENT_W_CITY 0 0.000000
REGION_RATING_CLIENT 0 0.000000
FLAG_EMAIL 0 0.000000
FLAG_PHONE 0 0.000000
FLAG_CONT_MOBILE 0 0.000000
FLAG_WORK_PHONE 0 0.000000
FLAG_EMP_PHONE 0 0.000000
FLAG_MOBIL 0 0.000000
DAYS_ID_PUBLISH 0 0.000000
DAYS_REGISTRATION 0 0.000000
DAYS_EMPLOYED 0 0.000000
DAYS_BIRTH 0 0.000000
REGION_POPULATION_RELATIVE 0 0.000000
NAME_HOUSING_TYPE 0 0.000000
NAME_FAMILY_STATUS 0 0.000000
NAME_EDUCATION_TYPE 0 0.000000
TARGET 0 0.000000
In [41]:
threshold=0.80
list_vars_not_null = list(pd_null_columnas[pd_null_columnas['porcentaje_columnas']<threshold].index)
pd_loan_filter_null = pd_loan.loc[:, list_vars_not_null]
pd_loan_filter_null.shape
Out[41]:
(307511, 121)

Se utiliza un umbral de 0.85 para eliminar columnas con más del 85% de valores nulos almacenando la totalidad de las columnas.

Aunque un umbral menor (como 0.6) podría ser más estricto, se decidió conservar más columnas, ya que las proporciones de nulos no son consideradas excesivamente altas.

In [42]:
pd_null_filas
Out[42]:
nulos_filas target porcentaje_filas
SK_ID_CURR
315294 61 0 0.504132
255145 61 0 0.504132
329262 61 0 0.504132
235599 61 0 0.504132
412312 61 0 0.504132
... ... ... ...
250717 0 0 0.000000
250702 0 0 0.000000
250697 0 0 0.000000
250680 0 0 0.000000
278202 0 0 0.000000

307511 rows × 3 columns

La fila con mayor porcentaje de valores missing obtiene un 50% compartiendo valores con varias filas más.

In [43]:
list_cat_vars, other = dame_variables_categoricas(dataset=pd_loan_filter_null)
pd_loan_filter_null[list_cat_vars] = pd_loan_filter_null[list_cat_vars].astype("category")
pd_loan_filter_null[list_cat_vars].head()
Out[43]:
FONDKAPREMONT_MODE FLOORSMIN_MODE FLOORSMIN_MEDI OWN_CAR_AGE ELEVATORS_MODE ELEVATORS_MEDI WALLSMATERIAL_MODE ENTRANCES_MEDI ENTRANCES_MODE HOUSETYPE_MODE FLOORSMAX_MEDI FLOORSMAX_MODE EMERGENCYSTATE_MODE OCCUPATION_TYPE AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_YEAR NAME_TYPE_SUITE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE CNT_FAM_MEMBERS FLAG_DOCUMENT_8 CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_9 FLAG_DOCUMENT_21 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 CNT_CHILDREN FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_12 ORGANIZATION_TYPE NAME_INCOME_TYPE LIVE_CITY_NOT_WORK_CITY NAME_CONTRACT_TYPE REG_CITY_NOT_WORK_CITY REG_CITY_NOT_LIVE_CITY LIVE_REGION_NOT_WORK_REGION REG_REGION_NOT_WORK_REGION REG_REGION_NOT_LIVE_REGION HOUR_APPR_PROCESS_START WEEKDAY_APPR_PROCESS_START REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT FLAG_EMAIL FLAG_PHONE FLAG_CONT_MOBILE FLAG_WORK_PHONE FLAG_EMP_PHONE FLAG_MOBIL REGION_POPULATION_RELATIVE NAME_HOUSING_TYPE NAME_FAMILY_STATUS NAME_EDUCATION_TYPE TARGET
SK_ID_CURR
100002 reg oper account 0.1250 0.1250 NaN 0.0000 0.00 Stone, brick 0.0690 0.0690 block of flats 0.0833 0.0833 No Laborers 0.0 0.0 0.0 0.0 0.0 1.0 Unaccompanied 2.0 2.0 2.0 2.0 1.0 0 M N Y 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Business Entity Type 3 Working 0 Cash loans 0 0 0 0 0 10 WEDNESDAY 2 2 0 1 1 0 1 1 0.018801 House / apartment Single / not married Secondary / secondary special 1
100003 reg oper account 0.3333 0.3333 NaN 0.0806 0.08 Block 0.0345 0.0345 block of flats 0.2917 0.2917 No Core staff 0.0 0.0 0.0 0.0 0.0 0.0 Family 0.0 1.0 1.0 0.0 2.0 0 F N N 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 School State servant 0 Cash loans 0 0 0 0 0 11 MONDAY 1 1 0 1 1 0 1 1 0.003541 House / apartment Married Higher education 0
100004 NaN NaN NaN 26.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN Laborers 0.0 0.0 0.0 0.0 0.0 0.0 Unaccompanied 0.0 0.0 0.0 0.0 1.0 0 M Y Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Government Working 0 Revolving loans 0 0 0 0 0 9 MONDAY 2 2 0 1 1 1 1 1 0.010032 House / apartment Single / not married Secondary / secondary special 0
100006 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Laborers NaN NaN NaN NaN NaN NaN Unaccompanied 0.0 2.0 2.0 0.0 2.0 0 F N Y 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Business Entity Type 3 Working 0 Cash loans 0 0 0 0 0 17 WEDNESDAY 2 2 0 0 1 0 1 1 0.008019 House / apartment Civil marriage Secondary / secondary special 0
100007 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Core staff 0.0 0.0 0.0 0.0 0.0 0.0 Unaccompanied 0.0 0.0 0.0 0.0 1.0 1 M N Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Religion Working 1 Cash loans 1 0 0 0 0 11 THURSDAY 2 2 0 0 1 0 1 1 0.028663 House / apartment Single / not married Secondary / secondary special 0
In [44]:
list_cat_vars
Out[44]:
['FONDKAPREMONT_MODE',
 'FLOORSMIN_MODE',
 'FLOORSMIN_MEDI',
 'OWN_CAR_AGE',
 'ELEVATORS_MODE',
 'ELEVATORS_MEDI',
 'WALLSMATERIAL_MODE',
 'ENTRANCES_MEDI',
 'ENTRANCES_MODE',
 'HOUSETYPE_MODE',
 'FLOORSMAX_MEDI',
 'FLOORSMAX_MODE',
 'EMERGENCYSTATE_MODE',
 'OCCUPATION_TYPE',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_YEAR',
 'NAME_TYPE_SUITE',
 'DEF_30_CNT_SOCIAL_CIRCLE',
 'OBS_60_CNT_SOCIAL_CIRCLE',
 'OBS_30_CNT_SOCIAL_CIRCLE',
 'DEF_60_CNT_SOCIAL_CIRCLE',
 'CNT_FAM_MEMBERS',
 'FLAG_DOCUMENT_8',
 'CODE_GENDER',
 'FLAG_OWN_CAR',
 'FLAG_OWN_REALTY',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_21',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'CNT_CHILDREN',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_12',
 'ORGANIZATION_TYPE',
 'NAME_INCOME_TYPE',
 'LIVE_CITY_NOT_WORK_CITY',
 'NAME_CONTRACT_TYPE',
 'REG_CITY_NOT_WORK_CITY',
 'REG_CITY_NOT_LIVE_CITY',
 'LIVE_REGION_NOT_WORK_REGION',
 'REG_REGION_NOT_WORK_REGION',
 'REG_REGION_NOT_LIVE_REGION',
 'HOUR_APPR_PROCESS_START',
 'WEEKDAY_APPR_PROCESS_START',
 'REGION_RATING_CLIENT_W_CITY',
 'REGION_RATING_CLIENT',
 'FLAG_EMAIL',
 'FLAG_PHONE',
 'FLAG_CONT_MOBILE',
 'FLAG_WORK_PHONE',
 'FLAG_EMP_PHONE',
 'FLAG_MOBIL',
 'REGION_POPULATION_RELATIVE',
 'NAME_HOUSING_TYPE',
 'NAME_FAMILY_STATUS',
 'NAME_EDUCATION_TYPE',
 'TARGET']
In [45]:
other
Out[45]:
['COMMONAREA_AVG',
 'COMMONAREA_MODE',
 'COMMONAREA_MEDI',
 'NONLIVINGAPARTMENTS_AVG',
 'NONLIVINGAPARTMENTS_MODE',
 'NONLIVINGAPARTMENTS_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'LIVINGAPARTMENTS_AVG',
 'LIVINGAPARTMENTS_MODE',
 'FLOORSMIN_AVG',
 'YEARS_BUILD_AVG',
 'YEARS_BUILD_MEDI',
 'YEARS_BUILD_MODE',
 'LANDAREA_MEDI',
 'LANDAREA_AVG',
 'LANDAREA_MODE',
 'BASEMENTAREA_MEDI',
 'BASEMENTAREA_AVG',
 'BASEMENTAREA_MODE',
 'EXT_SOURCE_1',
 'NONLIVINGAREA_AVG',
 'NONLIVINGAREA_MODE',
 'NONLIVINGAREA_MEDI',
 'ELEVATORS_AVG',
 'APARTMENTS_AVG',
 'APARTMENTS_MEDI',
 'APARTMENTS_MODE',
 'ENTRANCES_AVG',
 'LIVINGAREA_MEDI',
 'LIVINGAREA_MODE',
 'LIVINGAREA_AVG',
 'FLOORSMAX_AVG',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BEGINEXPLUATATION_AVG',
 'TOTALAREA_MODE',
 'EXT_SOURCE_3',
 'EXT_SOURCE_2',
 'AMT_GOODS_PRICE',
 'AMT_ANNUITY',
 'DAYS_LAST_PHONE_CHANGE',
 'AMT_INCOME_TOTAL',
 'AMT_CREDIT',
 'DAYS_ID_PUBLISH',
 'DAYS_REGISTRATION',
 'DAYS_EMPLOYED',
 'DAYS_BIRTH']

Echamos un vistazo a las listas de variables categóricas y otras.

In [46]:
len(list_cat_vars)
Out[46]:
74
In [47]:
len(other)
Out[47]:
47

Se observa que hemos obtenido 74 variables categóricas y 47 no categóricas.

In [48]:
pd_loan_filter_null[list_cat_vars].dtypes
Out[48]:
FONDKAPREMONT_MODE             category
FLOORSMIN_MODE                 category
FLOORSMIN_MEDI                 category
OWN_CAR_AGE                    category
ELEVATORS_MODE                 category
ELEVATORS_MEDI                 category
WALLSMATERIAL_MODE             category
ENTRANCES_MEDI                 category
ENTRANCES_MODE                 category
HOUSETYPE_MODE                 category
FLOORSMAX_MEDI                 category
FLOORSMAX_MODE                 category
EMERGENCYSTATE_MODE            category
OCCUPATION_TYPE                category
AMT_REQ_CREDIT_BUREAU_WEEK     category
AMT_REQ_CREDIT_BUREAU_HOUR     category
AMT_REQ_CREDIT_BUREAU_MON      category
AMT_REQ_CREDIT_BUREAU_QRT      category
AMT_REQ_CREDIT_BUREAU_DAY      category
AMT_REQ_CREDIT_BUREAU_YEAR     category
NAME_TYPE_SUITE                category
DEF_30_CNT_SOCIAL_CIRCLE       category
OBS_60_CNT_SOCIAL_CIRCLE       category
OBS_30_CNT_SOCIAL_CIRCLE       category
DEF_60_CNT_SOCIAL_CIRCLE       category
CNT_FAM_MEMBERS                category
FLAG_DOCUMENT_8                category
CODE_GENDER                    category
FLAG_OWN_CAR                   category
FLAG_OWN_REALTY                category
FLAG_DOCUMENT_2                category
FLAG_DOCUMENT_3                category
FLAG_DOCUMENT_4                category
FLAG_DOCUMENT_5                category
FLAG_DOCUMENT_6                category
FLAG_DOCUMENT_7                category
FLAG_DOCUMENT_9                category
FLAG_DOCUMENT_21               category
FLAG_DOCUMENT_10               category
FLAG_DOCUMENT_11               category
CNT_CHILDREN                   category
FLAG_DOCUMENT_13               category
FLAG_DOCUMENT_14               category
FLAG_DOCUMENT_15               category
FLAG_DOCUMENT_16               category
FLAG_DOCUMENT_17               category
FLAG_DOCUMENT_18               category
FLAG_DOCUMENT_19               category
FLAG_DOCUMENT_20               category
FLAG_DOCUMENT_12               category
ORGANIZATION_TYPE              category
NAME_INCOME_TYPE               category
LIVE_CITY_NOT_WORK_CITY        category
NAME_CONTRACT_TYPE             category
REG_CITY_NOT_WORK_CITY         category
REG_CITY_NOT_LIVE_CITY         category
LIVE_REGION_NOT_WORK_REGION    category
REG_REGION_NOT_WORK_REGION     category
REG_REGION_NOT_LIVE_REGION     category
HOUR_APPR_PROCESS_START        category
WEEKDAY_APPR_PROCESS_START     category
REGION_RATING_CLIENT_W_CITY    category
REGION_RATING_CLIENT           category
FLAG_EMAIL                     category
FLAG_PHONE                     category
FLAG_CONT_MOBILE               category
FLAG_WORK_PHONE                category
FLAG_EMP_PHONE                 category
FLAG_MOBIL                     category
REGION_POPULATION_RELATIVE     category
NAME_HOUSING_TYPE              category
NAME_FAMILY_STATUS             category
NAME_EDUCATION_TYPE            category
TARGET                         category
dtype: object
In [49]:
income_counts = pd_loan_filter_null['NAME_INCOME_TYPE'].value_counts()
income_percentages = pd_loan_filter_null['NAME_INCOME_TYPE'].value_counts(normalize=True) # Expreso los porcentajes de manera que podamos apreciar mejor los datos ya que en valor absoluto no somos conscientes del valor que pueden tener.

income_summary = pd.DataFrame({
    'Count': income_counts,
    'Percentage': income_percentages.round(2) 
})
income_summary
Out[49]:
Count Percentage
NAME_INCOME_TYPE
Working 158774 0.52
Commercial associate 71617 0.23
Pensioner 55362 0.18
State servant 21703 0.07
Unemployed 22 0.00
Student 18 0.00
Businessman 10 0.00
Maternity leave 5 0.00
In [50]:
plt.figure(figsize=(12, 6))
income_counts.plot(kind='bar', color='skyblue', edgecolor='black')
for i, (count, percentage) in enumerate(zip(income_counts, income_percentages)):
    plt.text(i, count + 1000, f'{percentage:.2f}', ha='center', fontsize=10)
plt.title('Distribución de NAME_INCOME_TYPE', fontsize=14)
plt.ylabel('Frecuencia', fontsize=12)
plt.xlabel('NAME_INCOME_TYPE', fontsize=12)
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
In [51]:
contract_type_counts = pd_loan_filter_null['NAME_CONTRACT_TYPE'].value_counts()
contract_type_percentage = pd_loan_filter_null['NAME_CONTRACT_TYPE'].value_counts(normalize=True)
contract_type_summary = pd.DataFrame({
    'Count': contract_type_counts,
    'Percentage': contract_type_percentage
})
contract_type_summary
Out[51]:
Count Percentage
NAME_CONTRACT_TYPE
Cash loans 278232 0.904787
Revolving loans 29279 0.095213
In [52]:
plt.figure(figsize=(12, 6))
contract_type_counts.plot(kind='bar', color='orange', edgecolor='black')
for i, (count, percentage) in enumerate(zip(contract_type_counts, contract_type_percentage)):
    plt.text(i, count + 1000, f'{percentage:.2f}', ha='center', fontsize=10)
plt.title('Distribución de NAME_CONTRACT_TYPE', fontsize=14)
plt.ylabel('Frecuencia', fontsize=12)
plt.xlabel('NAME_CONTRACT_TYPE', fontsize=12)
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image

Se observa que la mayoría de los préstamos son otorgados a trabajadores (más del 50%), seguido de comerciales asociados (23%) mientras que grupos como estudiantes, desempleados o empresarios son mucho menos frecuentes.

Por otro lado, los resultados de la celda 27 muestran que la mayoría de los préstamos en el dataset son "Cash loans" (préstamos en efectivo), con 278.232 registros (90%), mientras que los "Revolving loans" (préstamos rotativos) son mucho menos comunes, con 29.279 registros (10%). Esto sugiere que los clientes del dataset suelen preferir préstamos tradicionales con cantidades fijas y plazos definidos, mientras que el uso de créditos flexibles es significativamente menor.

In [53]:
pd_loan_filter_null[other].head(10)
Out[53]:
COMMONAREA_AVG COMMONAREA_MODE COMMONAREA_MEDI NONLIVINGAPARTMENTS_AVG NONLIVINGAPARTMENTS_MODE NONLIVINGAPARTMENTS_MEDI LIVINGAPARTMENTS_MEDI LIVINGAPARTMENTS_AVG LIVINGAPARTMENTS_MODE FLOORSMIN_AVG YEARS_BUILD_AVG YEARS_BUILD_MEDI YEARS_BUILD_MODE LANDAREA_MEDI LANDAREA_AVG LANDAREA_MODE BASEMENTAREA_MEDI BASEMENTAREA_AVG BASEMENTAREA_MODE EXT_SOURCE_1 NONLIVINGAREA_AVG NONLIVINGAREA_MODE NONLIVINGAREA_MEDI ELEVATORS_AVG APARTMENTS_AVG APARTMENTS_MEDI APARTMENTS_MODE ENTRANCES_AVG LIVINGAREA_MEDI LIVINGAREA_MODE LIVINGAREA_AVG FLOORSMAX_AVG YEARS_BEGINEXPLUATATION_MEDI YEARS_BEGINEXPLUATATION_MODE YEARS_BEGINEXPLUATATION_AVG TOTALAREA_MODE EXT_SOURCE_3 EXT_SOURCE_2 AMT_GOODS_PRICE AMT_ANNUITY DAYS_LAST_PHONE_CHANGE AMT_INCOME_TOTAL AMT_CREDIT DAYS_ID_PUBLISH DAYS_REGISTRATION DAYS_EMPLOYED DAYS_BIRTH
SK_ID_CURR
100002 0.0143 0.0144 0.0144 0.0000 0.0 0.0000 0.0205 0.0202 0.022 0.1250 0.6192 0.6243 0.6341 0.0375 0.0369 0.0377 0.0369 0.0369 0.0383 0.083037 0.0000 0.0 0.00 0.00 0.0247 0.0250 0.0252 0.0690 0.0193 0.0198 0.0190 0.0833 0.9722 0.9722 0.9722 0.0149 0.139376 0.262949 351000.0 24700.5 -1134.0 202500.0 406597.5 -2120 -3648.0 -637 -9461
100003 0.0605 0.0497 0.0608 0.0039 0.0 0.0039 0.0787 0.0773 0.079 0.3333 0.7960 0.7987 0.8040 0.0132 0.0130 0.0128 0.0529 0.0529 0.0538 0.311267 0.0098 0.0 0.01 0.08 0.0959 0.0968 0.0924 0.0345 0.0558 0.0554 0.0549 0.2917 0.9851 0.9851 0.9851 0.0714 NaN 0.622246 1129500.0 35698.5 -828.0 270000.0 1293502.5 -291 -1186.0 -1188 -16765
100004 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.729567 0.555912 135000.0 6750.0 -815.0 67500.0 135000.0 -2531 -4260.0 -225 -19046
100006 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.650442 297000.0 29686.5 -617.0 135000.0 312682.5 -2437 -9833.0 -3039 -19005
100007 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.322738 513000.0 21865.5 -1106.0 121500.0 513000.0 -3458 -4311.0 -3038 -19932
100008 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.621226 0.354225 454500.0 27517.5 -2536.0 99000.0 490495.5 -477 -4970.0 -1588 -16941
100009 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.774761 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.492060 0.724000 1395000.0 41301.0 -1562.0 171000.0 1560726.0 -619 -1213.0 -3130 -13778
100010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.540654 0.714279 1530000.0 42075.0 -1070.0 360000.0 1530000.0 -2379 -4597.0 -449 -18850
100011 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.587334 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.751724 0.205747 913500.0 33826.5 0.0 112500.0 1019610.0 -3514 -7427.0 365243 -20099
100012 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.746644 405000.0 20250.0 -1673.0 135000.0 405000.0 -3992 -14437.0 -2019 -14469
In [54]:
print(pd_loan_filter_null["OCCUPATION_TYPE"].value_counts().count())
pd_loan_filter_null["OCCUPATION_TYPE"]\
    .apply(lambda x: str(x).lower().strip()).value_counts(normalize=True)
18
Out[54]:
OCCUPATION_TYPE
laborers                 0.261396
sales staff              0.152056
core staff               0.130589
managers                 0.101227
drivers                  0.088116
high skill tech staff    0.053903
accountants              0.046481
medicine staff           0.040437
security staff           0.031835
cooking staff            0.028164
cleaning staff           0.022040
private service staff    0.012562
low-skill laborers       0.009914
waiters/barmen staff     0.006385
secretaries              0.006181
realty agents            0.003557
hr staff                 0.002667
it staff                 0.002491
Name: proportion, dtype: float64
In [55]:
top_occupation = pd_loan_filter_null['OCCUPATION_TYPE'].value_counts().head(10)
top_occupation.plot(kind='bar')
Out[55]:
<Axes: xlabel='OCCUPATION_TYPE'>
No description has been provided for this image

La tabla muestra la distribución proporcional de los diferentes tipos de ocupaciones en el dataset. Los "laborers" son el grupo más representado, con un 26.14% del total, seguidos por "sales staff" (15.26%) y "core staff" (13.05%). En contraste, ocupaciones como "IT staff" (0.25%) y "HR staff" (0.35%) tienen la menor representación.

Exportación de la tabla¶

In [56]:
pd_loan_filter_null.to_csv('./pd_data_initial_preprocessing.csv')